Exploratory Data analysis (EDA)

Analyzing the data sets to summarize their main characteristics of variables, often with visual graphs, without using a statistical model.

1. Overview of the data

Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables

# Overview of the data
ExpData(data=data,type=1)
# Structure of the data
ExpData(data=data,type=2)
Overview of the data
Structure of the data

Target variable

Summary of categorical dependent variable

  1. Variable name - SERVICES
  2. Variable description - Developmental Services Survey Dataset

2. Summary of numerical variables

Summary of all numerical variables

Summary statistics when dependent variable is categorical SERVICES. Summary statistics will be splitted into category level

ExpNumStat(data,by="GA",gp=Target,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)

3. Distributions of Numerical variables

  • Box plots for all numerical variables vs categorical dependent variable - Bivariate comparision only with categories

    • Quantile-quantile plot(Univariate)
    • Density plot (Univariate)
    • Box plot (Univariate and Bivariate)
    • Scatter plot (Bivariate)

Quantile-quantile plot for Numerical variables - Univariate

Quantile-quantile plot for all Numerical variables

ExpOutQQ(data,nlim=4,fname=NULL,Page=c(2,2),sample=sn)
## $`0`

Density plots for Numerical variables - Univariate

Density plot for all Numerical variables

ExpNumViz(data,target=NULL,type=1,Page=c(2,2),theme=theme,sample=sn)
## $`0`

Scatter plot for all numeric features - Bivariate analysis

ExpNumViz(data,target=Target,Page=c(2,1),theme=theme,sample=sn,scatter=TRUE)
## $`0`

Box plots for all numeric features vs categorical dependent variable - Bivariate comparision only with categories

Boxplot for all the numeric attributes by each category of SERVICES

ExpNumViz(data,target=Target,type=2,theme=theme,Page=c(2,2),sample=sn)
## $`0`

4. Summary of categorical variables

Summary of categorical variable

Cross tabulation with target variable

  • Custom tables between all categorical independent variables and traget variable SERVICES
ExpCTable(data,Target=Target,margin=1,clim=10,nlim=5,round=2,bin=NULL,per=F)

Information Value

ExpCatStat(data,Target=Target,Label=label,result = "IV",clim=10,nlim=5,Pclass=Rc)

Statistical test

ExpCatStat(data,Target=Target,Label=label,result = "Stat",clim=10,nlim=5,Pclass=Rc)

Variable importance based on Information value

varimp <- ExpCatStat(data,Target=Target,result = "Stat",clim=10,nlim=5,Pclass=Rc,bins=10,plot=TRUE,top=30,Round=2)

5. Distributions of categorical variables

Graphical representation of all categorical variables

  • Bar plot (Univariate)
  • Stacked Bar plot (Bivariate)

Bar plots for all categorical variables

  • Bar plot with vertical or horizontal bars for all categorical variables
ExpCatViz(data,target=NULL,fname=NULL,clim=10,margin=2,theme=theme,Page = c(2,1),sample=sc)
## $`0`

  • Stacked bar plot with vertical or horizontal bars for all categorical variables
ExpCatViz(data,target=Target,fname=NULL,clim=10,margin=2,theme=theme,Page = c(2,1),sample=sc)
## $`0`